Windows Azure: Building a Secure Backup System (part 3)

10/22/2010 9:19:14 AM

4. Protecting Data at Rest

Now that the transfer of data over the wire is secure, the next step is to secure the data when it has reached Microsoft’s servers. A reasonable question is, why bother? SSL protects against anyone snooping or modifying traffic over the wire. Microsoft implements various security practices (from physical to technological) to protect your data once it has reached its data centers. Isn’t this sufficient security?

For most cases, the answer is “yes.” This isn’t sufficient only when you have highly sensitive data (health data, for example) that has regulations and laws surrounding how it is stored. Though you may be secure enough in practice, you might still need to encrypt your data to comply with some regulation.

Once you’ve decided that your having data in the clear in Microsoft’s data centers isn’t acceptable (and you’ve taken into account the performance overhead of doing so), what do you do?

When Crypto Attacks

Cryptography is very dangerous. It is the technological equivalent of getting someone drunk on tequila and then giving him a loaded bazooka that can fire forward and backward. It (cryptography, not the fictitious bazooka) is fiendishly difficult to get right, and most blog posts/book samples get it wrong in some small, but nevertheless devastating, way.

The only known ways to ensure that some application is cryptographically sound is to do a thorough analysis of the cryptographic techniques it uses, and to let experts review/attack the application for a long time (sometimes years on end). Some of the widespread cryptographic products (be it the Windows crypto stack or OpenSSL) are so good because of the attention and analysis they’ve received.

With this in mind, you should reuse some well-known code/library for all your cryptographic needs. For example, GPGME is a great library for encrypting files. If you’re rolling your own implementation, ensure that you have professional cryptographers validate what you’re doing.

The code and techniques shown in this chapter should be sound. You can use them as a starting point for your own implementation, or to help you understand how other implementations work. However, you shouldn’t trust and reuse the code presented here directly in a production application for the simple reason that it hasn’t undergone thorough scrutiny from a legion of experts.

The goal here will be to achieve two things with any data you back up with the service presented in this chapter. The first is to encrypt data so that only the original user can decrypt it. The second is to digitally sign the encrypted data so that any modification is automatically detected.

4.1. Understanding the Basics of Cryptography

To thoroughly understand how this will be accomplished, you must first become familiar with some basics of cryptography. Entire books have been written on cryptography, so this is nothing more than the most fleeting of introductions, meant more to jog your memory than anything else.

Note: If you’ve never heard of these terms, you should spend a leisurely evening (or two…or several) reading up on them before writing any cryptography code. Unlike a lot of programming where a coder can explore and copy/paste code from the Web and get away with it, cryptography and security are places where not having a solid understanding of the fundamentals can bite you when you least expect it. To quote the old G.I. Joe advertising slogan, “Knowing is half the battle.” The other half is probably reusing other people’s tried-and-tested crypto code whenever you can.

4.1.1. Encryption/decryption

When the term encryption is used in this chapter, it refers to the process of converting data (plaintext) using an algorithm into a form (ciphertext) in which it is unreadable without possession of a key. Decryption is the reverse of this operation, in which a key is used to convert ciphertext back into plaintext.

4.1.2. Symmetric key algorithms

A symmetric key algorithm is one that uses the same key for both encryption and decryption. Popular examples are the Advanced Encryption Standard (AES, also known as Rijndael), Twofish, Serpent, and Blowfish. A major advantage of using symmetric algorithms is that they’re quite fast. However, this gets tempered with the disadvantage that both parties (the one doing the encryption and the one doing the decryption) need to know the same key.

4.1.3. Asymmetric key algorithms (public key cryptography)

An asymmetric key algorithm is one in which the key used for encryption is different from the one used for decryption. The major advantage is, of course, that the party doing the encryption doesn’t need to have access to the same key as the party doing the decryption.

Typically, each user has a pair of cryptographic keys: a public key and a private key. The public key may be widely distributed, but the private key is kept secret. Though the keys are related mathematically, the security of these algorithms depends on the fact that by knowing only one key, it is impossible (or at least infeasible) to derive the other.

Messages are encrypted with the recipient’s public key, and can be decrypted only with the associated private key. You can use this process in reverse to digitally sign data. The sender encrypts a hash of the data with his private key, and the recipient can decrypt the hash using the public key, and verify whether it matches a hash the recipient computes.

The major disadvantage of public key cryptography is that it is typically highly computationally intensive, and it is often impractical to encrypt large amounts of data this way. One common cryptographic technique is to use a symmetric key for quickly encrypting data, and then encrypting the symmetric key (which is quite small) with an asymmetric key algorithm. Popular asymmetric key algorithms include RSA, ElGamal, and others.

4.1.4. Cryptographic hash

A cryptographic hash function is one that takes an arbitrary block of data and returns a fixed set of bytes. This sounds just like a normal hash function such as the one you would use in a HashTable, correct? Not quite.

To be considered a cryptographically strong hash function, the algorithm must have a few key properties. It should be infeasible to find two messages with the same hash, or to change a message without changing its hash, or to determine contents of the message given its hash. Several of these algorithms are in wide use today. As of this writing, the current state-of-the-art algorithms are those in the SHA-2 family, and older algorithms such as MD5 and SHA-1 should be considered insecure.

With that short introduction to cryptography terminology, let’s get to the real meat of what you will do with azbackup: encrypt data.

4.2. Determining the Encryption Technique

The first criterion in picking an encryption technique is to ensure that someone getting access to the raw data on the cloud can’t decrypt. This means not only do you need a strong algorithm, but also you must keep the key you use to encrypt data away from the cloud. Actual encryption and decryption won’t happen in the cloud—it’ll happen in whichever machine talks to the cloud using your code. By keeping the key in a physically different location, you ensure that an attack on the cloud alone can’t compromise your data.

The second criterion in picking a design is to have different levels of access. In short, you can have machines that are trusted to back up data, but aren’t trusted to read backups.

A common scenario is to have a web server backing up logfiles, so it must have access to a key to encrypt data. However, you don’t trust the web server with the ability to decrypt all your data. To do this, you will use public key cryptography. The public key portion of the key will be used to encrypt data, and the private key will be used to decrypt backups. You can now have the public key on potentially insecure machines doing backups, but keep your private key (which is essentially the keys to the kingdom) close to your chest.

You’ll be using RSA with 2,048-bit keys as the asymmetric key algorithm. There are several other options to choose from (such as ElGamal), but RSA is as good an option as any other, as long as you are careful to use it in the way it was intended. As of this writing, 2,048 bits is the recommended length for keys given current computational power.

Note: Cryptographers claim that 2,048-bit keys are secure until 2030. In comparison, 1,024-bit keys are expected to be cracked by 2011.

Since the archives azbackup works on are typically very large in size, you can’t directly encrypt them using RSA. You’ll be generating a symmetric key unique to every archive (typically called the session key, though there is no session involved here), and using a block cipher to encrypt the actual data using that symmetric key. To do this, you’ll be using AES with 256-bit keys. Again, there are several choices, but AES is widely used, and as of this writing, 256 bits is the optimum key length.

Since you will use RSA to encrypt the per-archive key, you might as well use the same algorithm to sign the archives. Signing essentially takes the cryptographic hash of the encrypted data, and then encrypts the hash using the key you generated. Cryptographers frown on using the same key for both encryption and signing, so you’ll generate another RSA key pair to do this.

Don’t worry if all of this sounds a bit heavy. The actual code to do all this is quite simple and, more importantly, small.

Note: You might have noticed that the Windows Azure storage account key hasn’t been mentioned anywhere here. Many believe that public key cryptography is actually better for super-sensitive, government-regulated data because no one but you (not even Microsoft) has the key to get at the plaintext version of your data. However, the storage account key does add another layer of defense to your security. If others can’t get access to it, they can’t get your data.

4.3. Generating Keys

Let’s take a look at some code. Earlier, you learned that for the sample application you will be using two RSA keys: one for encrypting session keys for each archive, and one for signing the encrypted data. These keys will be stored in one file, and will be passed in the command line to azbackup. Since you can’t expect the users to have a couple of RSA keys lying around, you will need to provide a utility to generate it for them.

Since there’s a fair bit of crypto implementation in azbackup, they’re bundled together in a module called crypto with its implementation in crypto.py. You’ll learn about key pieces of code in this module as this discussion progresses.

Example 3 shows the code for the key-generation utility (creatively titled azbackup-keygen.py). By itself, it isn’t very interesting. All it does is to take in a command-line parameter (keyfile) for the path to generate the key to, and then calls the crypto module to do the actual key generation.

Example 3. The azbackup-gen-key.py utility

#!/usr/bin/env python
"""
azbackup-keygen

Generates two 2048 bit RSA keys and stores it in keyfile

Call it like this
azbackup -k keyfile
"""
import sys
import optparse
import crypto

def main():
     # parse command line options

    optp = optparse.OptionParser(__doc__)
    optp.add_option("-k","--keyfile",action="store",\
                    type="string", dest ="keyfile", default=None)
    (options, args) = optp.parse_args()


    if options.keyfile == None:
        optp.print_help()
        return

    crypto.generate_rsa_keys(options.keyfile)



if __name__== '__main__':
    main()

The real work is done by crypto.generate_rsa_keys. The implementation for that method lies in the crypto module. Let’s first see the code in Example 4, and then examine how it works.

Example 4. Crypto generation of RSA keys

try:
    import M2Crypto
    from M2Crypto import EVP, RSA, BIO
except:
    print "Couldn't import M2Crypto. Make sure it is installed."
    sys.exit(-1)

def generate_rsa_keys(keyfile):
    """ Generates two 2048 bit RSA keys and stores them sequentially
     (encryption key first,signing key second) in keyfile
    """
    # Generate the encryption key and store it in bio
    bio = BIO.MemoryBuffer()
    generate_rsa_key_bio(bio)

    #Generate the signing key and store it in bio
    generate_rsa_key_bio(bio)

    key_output = open(keyfile, 'wb')
    key_output.write(bio.read())


def generate_rsa_key_bio(bio, bits=2048, exponent = 65537):
    """ Generates a 2048 RSA key to the file.
     Use 65537 as default since the use of 3 might have some weaknesses"""
    def callback(*args):
        pass
    keypair = RSA.gen_key(bits, exponent, callback)
    keypair.save_key_bio(bio, None)

If you aren’t familiar with M2Crypto or OpenSSL programming, the code shown in Example 12-4 probably looks like gobbledygook. The first few lines import the M2Crypto module and import a few specific public classes inside that module. This is wrapped in a try/catch exception handler so that you can print a nice error message in case the import fails. This is the best way to check whether M2Crypto is correctly installed on the machine.

The three classes you are importing are EVP, RSA, and BIO. EVP (which is actually an acronym formed from the words “Digital EnVeloPe”) is a high-level interface to all the cipher suites supported by OpenSSL. It essentially provides support for encrypting and decrypting data using a wide range of algorithms. RSA, as the name suggests, is a wrapper around the OpenSSL RSA implementation. This provides support for generating RSA keys and encryption/decryption using RSA. Finally, BIO (which actually stands for “Binary Input Output”) is an I/O abstraction used by OpenSSL. Think of it as the means by which you can send and get byte arrays from OpenSSL.

The action kicks off in generate_rsa_keys. This calls out to generate_rsa_key_bio to generate the actual RSA public/private key pair, and then writes them into the key file. Two BIO.MemoryBuffer objects are allocated. These are the byte arrays into which generate_rsa_key_bio will write the RSA keys.

The key file’s format is fairly trivial. It contains the RSA key pair used for encryption, followed by the RSA key pair used for decryption. There is no particular reason to use this order or format. You could just as easily design a file format or, if you are feeling really evil, you could put the contents in an XML file. Doing it this way keeps things simple and makes it easy to read out the keys again. If you ever need keys of different sizes or types, you will need to revisit this format.

The actual work of generating an RSA public/private key pair is done by generate_rsa_key_bio. This makes a call to RSA.gen_key and specifies a bit length of 2,048 and a public exponent of 65,537. (This is an internal parameter used by the RSA algorithm typically set to either 3 or 65,537. Note that using 3 here is considered just as secure.)

The call to RSA.gen_key takes a long time to complete. In fact, the callback function passed in is empty, but the RSA.gen_key calls it with a progress number that can be used to visually indicate progress.

Why does this take so long? Though this has a bit to do with the complex math involved, most of the time goes into gathering entropy to ensure that the key is sufficiently random. Surprisingly, this process is sped up if there’s activity in the system. The OpenSSL command-line tool asks people to generate keyboard/mouse/disk activity. The key generation needs a source of random data (based on pure entropy), and hardware events are a good source of entropy.

Once the key pair has been generated, it is written out in a special encoded form into the BIO object passed in.

Note: If you plan to do this in a language other than Python, you don’t have to worry. Everything discussed here is typically a standard part of any mainstream framework.For .NET, to generate RSA keys use the RSACryptoServiceProvider class. Generating the PEM format from .NET is a bit trickier because it isn’t supported out of the box. Of course, you can choose to use some other format, or invent one of your own. If you want to persist with PEM, a quick web search shows up a lot of sample code to export keys in the PEM format. You can also P/Invoke the CryptExportPKCS8Ex function in Crypt32.dll.

Thankfully, all of this work is hidden under the covers. Generating a key file is quite simple. The following command generates a key file containing the two RSA key pairs at d:\foo.key:

d:\book\code\azbackup>python azbackup-gen-key.py --keyfile d:\foo.key

Warning: Remember to keep this key file safely tucked away. If you lose this key file, you can never decrypt your archives, and there is no way to recover the data inside them.

Related -----------------

- Windows Azure: Building a Secure Backup System (part 6) - Uploading Efficiently Using Blocks

- Windows Azure: Building a Secure Backup System (part 5)

- Windows Azure: Building a Secure Backup System (part 4)

- Windows Azure: Building a Secure Backup System (part 2) - Protecting Data in Motion

- Windows Azure: Building a Secure Backup System (part 1)

Other -----------------

- Understanding Windows Azure Roles

- The Windows Azure Tool Set

- Windows Azure Table Overview (part 2) - Azure Tables Versus Traditional Databases

- Windows Azure Table Overview (part 1) - Core Concepts

- Exploring Group Policy in Windows 7

- Working with Multiple Local Group Policy Objects

- The Windows Azure Sandbox

- Windows Azure : Peeking Under the Hood with a Command Shell (part 2) - Running the Command Proxy

- Windows Azure : Peeking Under the Hood with a Command Shell (part 1) - Building the Command Shell Proxy

- Windows 7 : Using Any Search Engine from the Address Bar

- Windows 7 : Understanding Internet Explorer Advanced Options